Skip to content

feat(q4_0): first-class Q4_0 core format + scalar kernel + SPI#648

Merged
michalharakal merged 1 commit into
chore/resync-api-dumpsfrom
feature/q4_0-core-format
May 30, 2026
Merged

feat(q4_0): first-class Q4_0 core format + scalar kernel + SPI#648
michalharakal merged 1 commit into
chore/resync-api-dumpsfrom
feature/q4_0-core-format

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

First of a stacked series promoting Q4_0 (older GGML 4-bit, 18 bytes / 32 elements) from a JVM/MemSegment-only side-path to a first-class quantized format — mirroring how Q8_0 is wired — so any loader can produce it and any backend can specialize it.

Stacked on #647 (API-dump resync). Base will retarget to develop once #647 merges. The .api delta here is only the ~69 Q4_0 lines.

What's in this PR (Phase A, part 1)

  • commonMain heap type: Q4_0TensorData interface + Q4_0BlockTensorData (ByteArray-backed, PackedBlockStorage, toFloatArray()), plus TensorEncoding.Q4_0 (32 elems / 18 bytes).
  • Kernel SPI: Q4_0MatmulKernel interface + KernelProvider.matmulQ4_0() (default null) and a "Q4_0" case in supports().
  • Scalar kernel: ScalarQ4_0MatmulKernel (portable commonMain floor) wired via ScalarKernelProvider.
  • Dispatch: DefaultCpuOpsJvm lazy q4_0MatmulKernel (KernelRegistry) + is Q4_0TensorData -> branch in chooseQuantizedMatmul.

Layout correctness note

Uses the canonical ggml split nibble layout (low nibbles → elements 0..15, high → 16..31; (code - 8) * d) matching DequantOps.dequantQ4_0FromBytesnot the interleaved layout the existing JVM MemSeg dotQ4_0BlockMemSeg uses. That mismatch is the likely reason the Q4_0 MemSeg path was never exercised; PR2 reconciles the MemSeg kernel to this layout.

Tests

  • Q4_0TensorDataTest — pins split layout + (code-8)*scale dequant against the canonical ggml decode.
  • Q4_0MatmulDispatchTest — dispatch routes through the kernel and matches the scalar reference (single/multi-batch, dim×dim).
  • KernelProviderSupportsTest — extended for the Q4_0 capability query.
  • apiCheck green.

Follow-ups (stacked)

PR2 Panama SIMD + MemSeg reconcile · PR3 Native FFM · PR4 FP32→Q4_0 quantizer + loader policy · PR5 docs. Targeting 0.27.0.

🤖 Generated with Claude Code

Promotes Q4_0 (older GGML 4-bit, 18 bytes / 32 elements) from a
JVM/MemSegment-only side-path to a first-class quantized format that
any loader can produce and any backend can specialize, mirroring Q8_0:

- commonMain `Q4_0TensorData` interface + `Q4_0BlockTensorData` (heap,
  ByteArray-backed) with `toFloatArray()` dequant and PackedBlockStorage.
- `TensorEncoding.Q4_0` (32 elems / 18 bytes).
- `Q4_0MatmulKernel` SPI + `KernelProvider.matmulQ4_0()` (default null)
  and a `"Q4_0"` case in `supports()`.
- `ScalarQ4_0MatmulKernel` (portable commonMain floor) wired through
  `ScalarKernelProvider`.
- `DefaultCpuOpsJvm`: lazy `q4_0MatmulKernel` resolved via KernelRegistry
  + an `is Q4_0TensorData ->` branch in `chooseQuantizedMatmul`.

Uses the canonical ggml *split* nibble layout (low nibbles → elements
0..15, high → 16..31, `(code - 8) * d`) matching
`DequantOps.dequantQ4_0FromBytes` — NOT the interleaved layout the
existing JVM MemSeg `dotQ4_0BlockMemSeg` uses (that mismatch is the
likely reason the Q4_0 MemSeg path was never exercised; PR2 reconciles
it).

Tests: Q4_0TensorDataTest (layout/dequant), Q4_0MatmulDispatchTest
(scalar==dispatch), KernelProviderSupportsTest extended for Q4_0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 5ff5a36 into chore/resync-api-dumps May 30, 2026
5 checks passed
@michalharakal michalharakal deleted the feature/q4_0-core-format branch May 30, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant